Predictive Analytics for Solar Power Generation: Developing Machine Learning Models Using Meteorological Data¶

AI 221 project by:

Daniel De Castro (ddecastro2@up.edu.ph)

University of the Philippines Diliman


Problem Statement¶

The fluctuating nature of solar power generation, primarily due to varying environmental conditions, presents significant challenges in energy management and grid integration. Existing prediction models often do not fully account for the complex interactions of meteorological factors, leading to inaccuracies in solar power output forecasts. This unpredictability impacts the efficiency of energy distribution and the reliability of solar power as a consistent energy source. An improved predictive model that comprehensively considers various environmental influences is crucial for enhancing the predictability and utility of solar energy.

Project Objective¶

The goal of this project is to harness the power of machine learning to develop robust models that accurately predict solar power generation from meteorological data. The objectives are as follows:

  1. Data Analysis and Feature Engineering: Conduct an in-depth analysis of the 'Solar Energy Power Generation' dataset. Identify key environmental factors and engineer features that effectively capture the dynamics impacting solar power output.

  2. Model Development: Explore and implement a range of machine learning algorithms, including but not limited to, linear regression, ensemble methods, and neural networks, to assess their suitability for this prediction task.

  3. Model Training and Validation: Utilize cross-validation strategies to train and fine-tune the models. Ensure that they not only fit the training data well but also generalize effectively to new, unseen data.

  4. Performance Evaluation: Evaluate the performance of each model using standard regression metrics like RMSE (Root Mean Squared Error), MAE (Mean Absolute Error), and R² (Coefficient of Determination). The best model will be selected based on its performance on these metrics.

  5. Insights and Recommendations: Provide insights into the impact of various meteorological factors on solar power generation. Based on the findings, offer practical recommendations for optimizing solar energy usage and future research directions.

By achieving these objectives, this project aims to significantly improve the predictability and efficiency of solar energy, thereby bolstering its viability as a sustainable energy resource.

In [ ]:
import pandas as pd
In [ ]:
RANDOM_SEED = 43  # for reproducibility

Dataset Overview¶

Kaggle - Solar Energy Power Generation Dataset provides comprehensive data on solar power generation along with various meteorological factors that could impact the generation of solar energy. It is a valuable resource for analyzing and predicting solar power generation based on environmental conditions.

In [ ]:
dataset = pd.read_csv("spg.csv")
dataset.head()
Out[ ]:
temperature_2_m_above_gnd relative_humidity_2_m_above_gnd mean_sea_level_pressure_MSL total_precipitation_sfc snowfall_amount_sfc total_cloud_cover_sfc high_cloud_cover_high_cld_lay medium_cloud_cover_mid_cld_lay low_cloud_cover_low_cld_lay shortwave_radiation_backwards_sfc ... wind_direction_10_m_above_gnd wind_speed_80_m_above_gnd wind_direction_80_m_above_gnd wind_speed_900_mb wind_direction_900_mb wind_gust_10_m_above_gnd angle_of_incidence zenith azimuth generated_power_kw
0 2.17 31 1035.0 0.0 0.0 0.0 0 0 0 0.00 ... 312.71 9.36 22.62 6.62 337.62 24.48 58.753108 83.237322 128.33543 454.10095
1 2.31 27 1035.1 0.0 0.0 0.0 0 0 0 1.78 ... 294.78 5.99 32.74 4.61 321.34 21.96 45.408585 75.143041 139.65530 1411.99940
2 3.65 33 1035.4 0.0 0.0 0.0 0 0 0 108.58 ... 270.00 3.89 56.31 3.76 286.70 14.04 32.848282 68.820648 152.53769 2214.84930
3 5.82 30 1035.4 0.0 0.0 0.0 0 0 0 258.10 ... 323.13 3.55 23.96 3.08 339.44 19.80 22.699288 64.883536 166.90159 2527.60920
4 7.73 27 1034.4 0.0 0.0 0.0 0 0 0 375.58 ... 10.01 6.76 25.20 6.62 22.38 16.56 19.199908 63.795208 182.13526 2640.20340

5 rows × 21 columns

The dataset contains several columns, each representing different environmental and solar energy-related measurements. Key columns include:

In [ ]:
dataset.describe().T
Out[ ]:
count mean std min 25% 50% 75% max
temperature_2_m_above_gnd 4213.0 15.068111 8.853677 -5.350000 8.390000 14.750000 21.290000 34.90000
relative_humidity_2_m_above_gnd 4213.0 51.361025 23.525864 7.000000 32.000000 48.000000 70.000000 100.00000
mean_sea_level_pressure_MSL 4213.0 1019.337812 7.022867 997.500000 1014.500000 1018.100000 1023.600000 1046.80000
total_precipitation_sfc 4213.0 0.031759 0.170212 0.000000 0.000000 0.000000 0.000000 3.20000
snowfall_amount_sfc 4213.0 0.002808 0.038015 0.000000 0.000000 0.000000 0.000000 1.68000
total_cloud_cover_sfc 4213.0 34.056990 42.843638 0.000000 0.000000 8.700000 100.000000 100.00000
high_cloud_cover_high_cld_lay 4213.0 14.458818 30.711707 0.000000 0.000000 0.000000 9.000000 100.00000
medium_cloud_cover_mid_cld_lay 4213.0 20.023499 36.387948 0.000000 0.000000 0.000000 10.000000 100.00000
low_cloud_cover_low_cld_lay 4213.0 21.373368 38.013885 0.000000 0.000000 0.000000 10.000000 100.00000
shortwave_radiation_backwards_sfc 4213.0 387.759036 278.459293 0.000000 142.400000 381.810000 599.860000 952.30000
wind_speed_10_m_above_gnd 4213.0 16.228787 9.876948 0.000000 9.010000 14.460000 21.840000 61.18000
wind_direction_10_m_above_gnd 4213.0 195.078452 106.626782 0.540000 153.190000 191.770000 292.070000 360.00000
wind_speed_80_m_above_gnd 4213.0 18.978483 11.999960 0.000000 10.140000 16.240000 26.140000 66.88000
wind_direction_80_m_above_gnd 4213.0 191.166862 108.760021 1.120000 130.240000 187.770000 292.040000 360.00000
wind_speed_900_mb 4213.0 16.363190 9.885330 0.000000 9.180000 14.490000 21.970000 61.11000
wind_direction_900_mb 4213.0 192.447911 106.516195 1.120000 148.220000 187.990000 288.000000 360.00000
wind_gust_10_m_above_gnd 4213.0 20.583489 12.648899 0.720000 11.160000 18.000000 27.000000 84.96000
angle_of_incidence 4213.0 50.837490 26.638965 3.755323 29.408181 47.335557 69.197492 121.63592
zenith 4213.0 59.980947 19.857711 17.727761 45.291631 62.142611 74.346737 128.41537
azimuth 4213.0 169.167651 64.568385 54.379093 114.136600 163.241650 225.085620 289.04518
generated_power_kw 4213.0 1134.347313 937.957247 0.000595 231.700450 971.642650 2020.966700 3056.79410
  • temperature_2_m_above_gnd: Temperature measured 2 meters above the ground. It ranges from -5.35°C to 34.9°C with an average of 15.07°C.

  • relative_humidity_2_m_above_gnd: Relative humidity measured 2 meters above the ground, ranging from 7% to 100%, with an average of 51.36%.

  • mean_sea_level_pressure_MSL: Mean sea level pressure in millibars, varying from 997.5 to 1046.8, with an average of 1019.34.

  • total_precipitation_sfc: Total precipitation at the surface level, ranging from 0 to 3.2 mm, with an average of 0.03 mm.

  • snowfall_amount_sfc: Snowfall amount at the surface level, ranging from 0 to 1.68 mm, with an average of 0.003 mm.

  • total_cloud_cover_sfc: Total cloud cover percentage at the surface level, ranging from 0% to 100%, with an average of 34.06%.

  • high_cloud_cover_high_cld_lay: High cloud cover percentage in the high cloud layer, ranging from 0% to 100%, with an average of 14.46%.

  • medium_cloud_cover_mid_cld_lay: Medium cloud cover percentage in the mid cloud layer, ranging from 0% to 100%, with an average of 20.02%.

  • low_cloud_cover_low_cld_lay: Low cloud cover percentage in the low cloud layer, ranging from 0% to 100%, with an average of 21.37%.

  • shortwave_radiation_backwards_sfc: Shortwave radiation backwards at the surface level in watts per square meter, ranging from 0 to 952.3, with an average of 387.76.

  • wind_speed_10_m_above_gnd: Wind speed measured 10 meters above the ground in km/h, ranging from 0 to 61.18, with an average of 16.23.

  • wind_direction_10_m_above_gnd: Wind direction measured 10 meters above the ground in degrees, ranging from 0° to 360°, with an average of 195.08°.

  • wind_speed_80_m_above_gnd: Wind speed measured 80 meters above the ground in km/h, ranging from 0 to 66.88, with an average of 18.98.

  • wind_direction_80_m_above_gnd: Wind direction measured 80 meters above the ground in degrees, ranging from 1.12° to 360°, with an average of 191.17°.

  • wind_speed_900_mb: Wind speed at 900 millibars pressure level in km/h, ranging from 0 to 61.11, with an average of 16.36.

  • wind_direction_900_mb: Wind direction at 900 millibars pressure level in degrees, ranging from 1.12° to 360°, with an average of 192.45°.

  • wind_gust_10_m_above_gnd: Wind gust speed measured 10 meters above the ground in km/h, ranging from 0.72 to 84.96, with an average of 20.58.

  • angle_of_incidence: The angle of incidence in degrees, ranging from 3.76° to 121.64°, with an average of 50.84°.

  • zenith: The zenith angle in degrees, ranging from 17.73° to 128.42°, with an average of 59.98°.

  • azimuth: The azimuth angle in degrees, ranging from 54.38° to 289.05°, with an average of 169.17°.

  • generated_power_kw: The generated power in kilowatts, ranging from approximately 0 kW to 3056.79 kW, with an average of 1134.35 kW.

Data Preprocessing¶

In [ ]:
# Handling missing values, if any
data = dataset.dropna()
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4213 entries, 0 to 4212
Data columns (total 21 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   temperature_2_m_above_gnd          4213 non-null   float64
 1   relative_humidity_2_m_above_gnd    4213 non-null   int64  
 2   mean_sea_level_pressure_MSL        4213 non-null   float64
 3   total_precipitation_sfc            4213 non-null   float64
 4   snowfall_amount_sfc                4213 non-null   float64
 5   total_cloud_cover_sfc              4213 non-null   float64
 6   high_cloud_cover_high_cld_lay      4213 non-null   int64  
 7   medium_cloud_cover_mid_cld_lay     4213 non-null   int64  
 8   low_cloud_cover_low_cld_lay        4213 non-null   int64  
 9   shortwave_radiation_backwards_sfc  4213 non-null   float64
 10  wind_speed_10_m_above_gnd          4213 non-null   float64
 11  wind_direction_10_m_above_gnd      4213 non-null   float64
 12  wind_speed_80_m_above_gnd          4213 non-null   float64
 13  wind_direction_80_m_above_gnd      4213 non-null   float64
 14  wind_speed_900_mb                  4213 non-null   float64
 15  wind_direction_900_mb              4213 non-null   float64
 16  wind_gust_10_m_above_gnd           4213 non-null   float64
 17  angle_of_incidence                 4213 non-null   float64
 18  zenith                             4213 non-null   float64
 19  azimuth                            4213 non-null   float64
 20  generated_power_kw                 4213 non-null   float64
dtypes: float64(17), int64(4)
memory usage: 691.3 KB

Exploratory Data Analysis (EDA)¶

In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns

Conduct a correlation analysis to identify pairs of variables that have a strong linear relationship. High correlation (both positive and negative) can indicate interesting pairs to visualize.

In [ ]:
correlation_matrix = data.corr()
correlation_matrix
Out[ ]:
temperature_2_m_above_gnd relative_humidity_2_m_above_gnd mean_sea_level_pressure_MSL total_precipitation_sfc snowfall_amount_sfc total_cloud_cover_sfc high_cloud_cover_high_cld_lay medium_cloud_cover_mid_cld_lay low_cloud_cover_low_cld_lay shortwave_radiation_backwards_sfc ... wind_direction_10_m_above_gnd wind_speed_80_m_above_gnd wind_direction_80_m_above_gnd wind_speed_900_mb wind_direction_900_mb wind_gust_10_m_above_gnd angle_of_incidence zenith azimuth generated_power_kw
temperature_2_m_above_gnd 1.000000 -0.771704 -0.402240 -0.083137 -0.121422 -0.326641 -0.019522 -0.100980 -0.381876 0.665755 ... 0.051393 -0.244869 0.086630 -0.198107 0.043233 -0.188264 -0.090173 -0.545646 0.381797 0.217280
relative_humidity_2_m_above_gnd -0.771704 1.000000 0.100529 0.168660 0.113987 0.402895 0.056452 0.135347 0.490402 -0.721754 ... 0.008902 0.212868 -0.019408 0.135464 0.021068 0.144807 0.268460 0.513748 -0.525760 -0.336783
mean_sea_level_pressure_MSL -0.402240 0.100529 1.000000 -0.159098 -0.053871 -0.151995 -0.014646 -0.129812 -0.162043 -0.188387 ... -0.119867 -0.131442 -0.161020 -0.145696 -0.125234 -0.189266 -0.075619 0.268111 -0.137872 0.150551
total_precipitation_sfc -0.083137 0.168660 -0.159098 1.000000 0.184497 0.223678 0.076255 0.262367 0.282748 -0.130358 ... 0.005234 0.052376 0.007131 0.044797 0.003216 0.066701 -0.020965 -0.023408 0.005749 -0.118442
snowfall_amount_sfc -0.121422 0.113987 -0.053871 0.184497 1.000000 0.112646 -0.026356 0.042867 0.151609 -0.073499 ... 0.039734 0.093156 0.041246 0.100405 0.041716 0.093060 -0.012497 0.033554 0.008426 -0.049508
total_cloud_cover_sfc -0.326641 0.402895 -0.151995 0.223678 0.112646 1.000000 0.442865 0.712077 0.746225 -0.345089 ... 0.055057 0.183732 0.039671 0.174510 0.057816 0.212142 -0.003426 0.136249 -0.037427 -0.334338
high_cloud_cover_high_cld_lay -0.019522 0.056452 -0.014646 0.076255 -0.026356 0.442865 1.000000 0.593300 0.024703 -0.089620 ... 0.017688 0.090049 0.018228 0.078204 0.020897 0.092842 -0.033840 0.031766 0.020790 -0.147723
medium_cloud_cover_mid_cld_lay -0.100980 0.135347 -0.129812 0.262367 0.042867 0.712077 0.593300 1.000000 0.236716 -0.199843 ... 0.016954 0.088972 0.021935 0.076192 0.017195 0.079627 -0.035511 0.046719 0.014802 -0.227834
low_cloud_cover_low_cld_lay -0.381876 0.490402 -0.162043 0.282748 0.151609 0.746225 0.024703 0.236716 1.000000 -0.336751 ... 0.040060 0.156204 0.021782 0.153578 0.039875 0.193846 0.013421 0.120854 -0.054328 -0.288066
shortwave_radiation_backwards_sfc 0.665755 -0.721754 -0.188387 -0.130358 -0.073499 -0.345089 -0.089620 -0.199843 -0.336751 1.000000 ... -0.076530 -0.077090 -0.051670 0.028929 -0.081545 0.017212 -0.576921 -0.801892 0.549296 0.556148
wind_speed_10_m_above_gnd -0.172532 0.109674 -0.170199 0.044384 0.103749 0.175869 0.069620 0.069307 0.161919 0.078791 ... -0.035788 0.957745 -0.005156 0.992851 -0.017289 0.898893 -0.173060 -0.041168 0.194680 -0.083043
wind_direction_10_m_above_gnd 0.051393 0.008902 -0.119867 0.005234 0.039734 0.055057 0.017688 0.016954 0.040060 -0.076530 ... 1.000000 -0.023300 0.891487 -0.046880 0.930226 0.059981 0.054676 0.044775 0.009908 -0.073257
wind_speed_80_m_above_gnd -0.244869 0.212868 -0.131442 0.052376 0.093156 0.183732 0.090049 0.088972 0.156204 -0.077090 ... -0.023300 1.000000 0.005862 0.969352 -0.003115 0.898347 -0.049618 0.091319 0.064278 -0.157899
wind_direction_80_m_above_gnd 0.086630 -0.019408 -0.161020 0.007131 0.041246 0.039671 0.018228 0.021935 0.021782 -0.051670 ... 0.891487 0.005862 1.000000 -0.014577 0.919390 0.065285 0.051170 0.029259 0.017849 -0.069941
wind_speed_900_mb -0.198107 0.135464 -0.145696 0.044797 0.100405 0.174510 0.078204 0.076192 0.153578 0.028929 ... -0.046880 0.969352 -0.014577 1.000000 -0.026721 0.894006 -0.136442 0.004675 0.155932 -0.107615
wind_direction_900_mb 0.043233 0.021068 -0.125234 0.003216 0.041716 0.057816 0.020897 0.017195 0.039875 -0.081545 ... 0.930226 -0.003115 0.919390 -0.026721 1.000000 0.071530 0.056517 0.048158 -0.000427 -0.077435
wind_gust_10_m_above_gnd -0.188264 0.144807 -0.189266 0.066701 0.093060 0.212142 0.092842 0.079627 0.193846 0.017212 ... 0.059981 0.898347 0.065285 0.894006 0.071530 1.000000 -0.122335 -0.006612 0.152166 -0.122808
angle_of_incidence -0.090173 0.268460 -0.075619 -0.020965 -0.012497 -0.003426 -0.033840 -0.035511 0.013421 -0.576921 ... 0.054676 -0.049618 0.051170 -0.136442 0.056517 -0.122335 1.000000 0.712773 -0.288647 -0.646537
zenith -0.545646 0.513748 0.268111 -0.023408 0.033554 0.136249 0.031766 0.046719 0.120854 -0.801892 ... 0.044775 0.091319 0.029259 0.004675 0.048158 -0.006612 0.712773 1.000000 -0.247447 -0.649991
azimuth 0.381797 -0.525760 -0.137872 0.005749 0.008426 -0.037427 0.020790 0.014802 -0.054328 0.549296 ... 0.009908 0.064278 0.017849 0.155932 -0.000427 0.152166 -0.288647 -0.247447 1.000000 -0.061184
generated_power_kw 0.217280 -0.336783 0.150551 -0.118442 -0.049508 -0.334338 -0.147723 -0.227834 -0.288066 0.556148 ... -0.073257 -0.157899 -0.069941 -0.107615 -0.077435 -0.122808 -0.646537 -0.649991 -0.061184 1.000000

21 rows × 21 columns

In [ ]:
# Correlation matrix heatmap
plt.figure(figsize=(12, 12))
sns.heatmap(correlation_matrix, annot=True)
plt.show()
No description has been provided for this image

Identify the top correlations

Positive Correlations:

In [ ]:
# Flatten the correlation matrix and exclude self-correlations
correlation_pairs = correlation_matrix.unstack()
positive_sorted_pairs = correlation_pairs.sort_values(kind="quicksort", ascending=False)

# Exclude self correlations (correlation of a variable with itself will always be 1)
positive_no_self_correlation = positive_sorted_pairs[positive_sorted_pairs != 1]

# Display top positively correlated pairs
positive_no_self_correlation.head(10)
Out[ ]:
wind_speed_10_m_above_gnd      wind_speed_900_mb                0.992851
wind_speed_900_mb              wind_speed_10_m_above_gnd        0.992851
                               wind_speed_80_m_above_gnd        0.969352
wind_speed_80_m_above_gnd      wind_speed_900_mb                0.969352
                               wind_speed_10_m_above_gnd        0.957745
wind_speed_10_m_above_gnd      wind_speed_80_m_above_gnd        0.957745
wind_direction_900_mb          wind_direction_10_m_above_gnd    0.930226
wind_direction_10_m_above_gnd  wind_direction_900_mb            0.930226
wind_direction_900_mb          wind_direction_80_m_above_gnd    0.919390
wind_direction_80_m_above_gnd  wind_direction_900_mb            0.919390
dtype: float64

Negative Correlations:

In [ ]:
negative_sorted_pairs = correlation_pairs.sort_values(kind="quicksort", ascending=True)

# Exclude self correlations (correlation of a variable with itself will always be 1)
negative_no_self_correlation = negative_sorted_pairs[negative_sorted_pairs != 1]

# Display top positively correlated pairs
negative_no_self_correlation.head(10)
Out[ ]:
zenith                             shortwave_radiation_backwards_sfc   -0.801892
shortwave_radiation_backwards_sfc  zenith                              -0.801892
temperature_2_m_above_gnd          relative_humidity_2_m_above_gnd     -0.771704
relative_humidity_2_m_above_gnd    temperature_2_m_above_gnd           -0.771704
                                   shortwave_radiation_backwards_sfc   -0.721754
shortwave_radiation_backwards_sfc  relative_humidity_2_m_above_gnd     -0.721754
zenith                             generated_power_kw                  -0.649991
generated_power_kw                 zenith                              -0.649991
angle_of_incidence                 generated_power_kw                  -0.646537
generated_power_kw                 angle_of_incidence                  -0.646537
dtype: float64

Top correlations

In [ ]:
# combine negative and positive correlations
top_correlated_pairs = pd.concat(
    [positive_no_self_correlation.head(10), negative_no_self_correlation.head(11)]
)
top_correlated_pairs
Out[ ]:
wind_speed_10_m_above_gnd          wind_speed_900_mb                    0.992851
wind_speed_900_mb                  wind_speed_10_m_above_gnd            0.992851
                                   wind_speed_80_m_above_gnd            0.969352
wind_speed_80_m_above_gnd          wind_speed_900_mb                    0.969352
                                   wind_speed_10_m_above_gnd            0.957745
wind_speed_10_m_above_gnd          wind_speed_80_m_above_gnd            0.957745
wind_direction_900_mb              wind_direction_10_m_above_gnd        0.930226
wind_direction_10_m_above_gnd      wind_direction_900_mb                0.930226
wind_direction_900_mb              wind_direction_80_m_above_gnd        0.919390
wind_direction_80_m_above_gnd      wind_direction_900_mb                0.919390
zenith                             shortwave_radiation_backwards_sfc   -0.801892
shortwave_radiation_backwards_sfc  zenith                              -0.801892
temperature_2_m_above_gnd          relative_humidity_2_m_above_gnd     -0.771704
relative_humidity_2_m_above_gnd    temperature_2_m_above_gnd           -0.771704
                                   shortwave_radiation_backwards_sfc   -0.721754
shortwave_radiation_backwards_sfc  relative_humidity_2_m_above_gnd     -0.721754
zenith                             generated_power_kw                  -0.649991
generated_power_kw                 zenith                              -0.649991
angle_of_incidence                 generated_power_kw                  -0.646537
generated_power_kw                 angle_of_incidence                  -0.646537
shortwave_radiation_backwards_sfc  angle_of_incidence                  -0.576921
dtype: float64

Visualize the top correlations

In [ ]:
# Extracting the column names for each pair
pairs = [(index[0], index[1]) for index in top_correlated_pairs.index]

# Plotting each pair
# Determine the number of rows and columns for the subplot grid
n_rows = 7
n_cols = 3

# Create a grid of subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(10, 20))

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Plotting each pair in the grid
for i, (x, y) in enumerate(pairs):
    sns.scatterplot(x=x, y=y, data=data, ax=axes[i], alpha=0.5)

# Adjust layout for better spacing
plt.tight_layout()
plt.show()
No description has been provided for this image

Visualize correlations of each variable with the target variable

In [ ]:
target_variable = "generated_power_kw"

# Extracting all feature names except the target variable
feature_names = [col for col in data.columns if col != target_variable]

# Determine the number of rows and columns for the subplot grid
n_rows = 7
n_cols = 3

# Create a grid of subplots
fig, axes = plt.subplots(n_rows, n_cols, figsize=(10, 20))

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Plotting each feature against the target variable
for i, feature in enumerate(feature_names):
    sns.scatterplot(x=feature, y=target_variable, data=data, ax=axes[i], alpha=0.5)

# Adjust layout for better spacing
plt.tight_layout()
plt.show()
No description has been provided for this image

General Insights:¶

  • Temperature and Power Generation: The correlation between temperature_2_m_above_gnd and generated_power_kw is positive (around 0.22), suggesting that higher temperatures might be associated with increased solar power generation. This could be due to more intense sunlight and favorable conditions for solar panels.

  • Humidity and Cloud Cover: There's a negative correlation between relative_humidity_2_m_above_gnd, total_cloud_cover_sfc, and generated_power_kw (around -0.34 for both), indicating that higher humidity and more cloud cover are likely associated with lower solar power output. This is expected as clouds and humidity can reduce solar radiation reaching the solar panels.

  • Shortwave Radiation: A strong positive correlation (around 0.56) is observed between shortwave_radiation_backwards_sfc and generated_power_kw. This is quite intuitive as more solar radiation directly translates to higher potential for solar power generation.

  • Wind Speed: The correlations between different wind speed measurements (wind_speed_10_m_above_gnd, wind_speed_80_m_above_gnd, wind_speed_900_mb) and generated_power_kw are slightly negative, suggesting that higher wind speeds might not directly contribute to increased solar power generation. This might be due to the fact that wind speed doesn't directly affect solar radiation.

  • Angle of Incidence and Zenith: Both angle_of_incidence and zenith have strong negative correlations with generated_power_kw (around -0.65 for both). This indicates that the position of the sun (angle and zenith) plays a significant role in power generation, likely due to the varying intensity of solar radiation throughout the day.

The correlation matrix reveals meaningful relationships between various meteorological factors and solar power generation. Key factors like temperature, shortwave radiation, cloud cover, and sun positioning (angle of incidence and zenith) show significant correlations with power generation, aligning well with the fundamental principles of solar energy. Understanding these relationships can guide more detailed analyses, especially for predicting solar power generation based on weather conditions.

Feature Selection and Reduction¶

In [ ]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
In [ ]:
# Separating features and target variable
X = data.drop(target_variable, axis=1)
y = data[target_variable]

Standardizing the features

In [ ]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Using PCA to determine the appropriate number of principal components to keep

In [ ]:
pca = PCA(random_state=RANDOM_SEED)
X_pca = pca.fit_transform(X_scaled)

Get the Cumulative Proportion of Variance (CPV) explained by each component

In [ ]:
cpv = np.cumsum(pca.explained_variance_ratio_)
cpv
Out[ ]:
array([0.22490081, 0.41335927, 0.55830037, 0.66921138, 0.73559625,
       0.79702998, 0.8447486 , 0.88669961, 0.92465582, 0.94380398,
       0.96143295, 0.96998183, 0.97707577, 0.98311315, 0.98810758,
       0.9920493 , 0.99529021, 0.99830264, 0.9997545 , 1.        ])
In [ ]:
# Plotting the CPV
plt.figure(figsize=(10, 6))
plt.plot(cpv, marker="o", linestyle="--")
plt.xlabel("Number of Principal Components")
plt.ylabel("Cumulative Proportion of Variance Explained")
plt.title("Cumulative Proportion of Variance (CPV) Explained by PCA Components")
plt.grid()
plt.show()
No description has been provided for this image

Based on the CPV, we can see that the first 11 components explain 95% of the variance in the data. This means that we can reduce the number of features from 20 to 11 without losing much information.

In [ ]:
# determine the number of components that explain at least 95% of the variance
n_components = np.where(cpv >= 0.95)[0][0] + 1
n_components
Out[ ]:
11

Using PCA to reduce the number of features to 11

In [ ]:
pca_reduced = PCA(n_components=n_components, random_state=RANDOM_SEED)
X_pca_reduced = pca_reduced.fit_transform(X_scaled)
X_pca_reduced.shape
Out[ ]:
(4213, 11)

Model Selection and Evaluation¶

Using holdout validation, dataset is divided into training and testing sets using a 80-20 split. This means 80% of the data will be used for training the models, and 20% will be used for testing.

In [ ]:
from sklearn.model_selection import train_test_split
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(
    X_pca_reduced, y, test_size=0.2, random_state=RANDOM_SEED
)
X_train.shape, X_test.shape
Out[ ]:
((3370, 11), (843, 11))

Model selection¶

Different models are considered for this regression task:

  1. Linear Regression
  2. Ridge Regression
  3. Lasso Regression
  4. Elastic Net Regression
  5. Random Forest Regression
  6. Support Vector Regression
  7. Gradient Boosting Regression
  8. K-Nearest Neighbors Regression
  9. Decision Tree Regression
  10. Neural Network Regression
  11. Stochastic Gradient Descent Regression
  12. AdaBoost Regression
In [ ]:
from sklearn.linear_model import (
    LinearRegression,
    Ridge,
    ElasticNet,
    Lasso,
    SGDRegressor,
)
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    r2_score,
    mean_absolute_percentage_error,
)

Define the models with default parameters

In [ ]:
models = {
    "Linear Regression": LinearRegression(),
    "Ridge": Ridge(random_state=RANDOM_SEED),
    "Lasso": Lasso(random_state=RANDOM_SEED),
    "Elastic Net": ElasticNet(random_state=RANDOM_SEED),
    "Random Forest": RandomForestRegressor(random_state=RANDOM_SEED),
    "Support Vector Regressor": SVR(),
    "Gradient Boosting Regressor": GradientBoostingRegressor(random_state=RANDOM_SEED),
    "K-Nearest Neighbors": KNeighborsRegressor(),
    "Decision Tree": DecisionTreeRegressor(),
    "Neural Network": MLPRegressor(random_state=RANDOM_SEED),
    "Stochastic Gradient Descent": SGDRegressor(random_state=RANDOM_SEED),
    "Adaboost Regressor": AdaBoostRegressor(random_state=RANDOM_SEED),
}

Define the regression metrics to evaluate the models

In [ ]:
def regression_metrics(y_true, y_pred):
    mae = mean_absolute_error(y_true, y_pred)
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_true, y_pred)
    mape = mean_absolute_percentage_error(y_true, y_pred)
    return {
        "Mean Absolute Error": mae,
        "Mean Squared Error": mse,
        "Root Mean Squared Error": rmse,
        "R2 Score": r2,
        "Mean Absolute Percentage Error": mape,
        "Mean Bias Error": np.mean(y_pred - y_true),
    }

Train the models and evaluate performance on the test set

In [ ]:
regression_results = {}
for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    regression_results[name] = regression_metrics(y_test, y_pred)

regression_results_df = pd.DataFrame(regression_results).T
regression_results_df
Training Linear Regression...
Training Ridge...
Training Lasso...
Training Elastic Net...
Training Random Forest...
Training Support Vector Regressor...
Training Gradient Boosting Regressor...
Training K-Nearest Neighbors...
Training Decision Tree...
Training Neural Network...
/opt/homebrew/lib/python3.11/site-packages/sklearn/neural_network/_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
Training Stochastic Gradient Descent...
Training Adaboost Regressor...
Out[ ]:
Mean Absolute Error Mean Squared Error Root Mean Squared Error R2 Score Mean Absolute Percentage Error Mean Bias Error
Linear Regression 399.579896 267512.251601 517.215866 0.706040 482.908191 -28.937825
Ridge 399.597218 267515.297004 517.218810 0.706036 482.901372 -28.945068
Lasso 400.000643 267604.829700 517.305354 0.705938 482.911397 -29.218607
Elastic Net 457.783607 310797.582356 557.492226 0.658475 636.834457 -36.073312
Random Forest 325.025954 209870.702463 458.116473 0.769380 185.637110 -15.090408
Support Vector Regressor 765.358677 770179.800557 877.598884 0.153675 1148.574269 -218.317426
Gradient Boosting Regressor 358.759991 242029.866308 491.965310 0.734042 267.367222 -25.521922
K-Nearest Neighbors 328.165821 225101.558087 474.448689 0.752643 455.182658 0.885740
Decision Tree 422.534791 407765.408259 638.565117 0.551920 471.186008 -12.077871
Neural Network 369.533026 262163.305875 512.018853 0.711918 211.139910 -68.978097
Stochastic Gradient Descent 399.704186 266651.370733 516.382969 0.706986 504.922495 -31.717054
Adaboost Regressor 583.034156 426736.188421 653.250479 0.531074 749.475476 31.755897

Selecting the best model¶

Select the best model based on their performance on the test set

In [ ]:
# sort the results by R2 score in descending order
regression_results_df.sort_values(by="R2 Score", ascending=False)
Out[ ]:
Mean Absolute Error Mean Squared Error Root Mean Squared Error R2 Score Mean Absolute Percentage Error Mean Bias Error
Random Forest 325.025954 209870.702463 458.116473 0.769380 185.637110 -15.090408
K-Nearest Neighbors 328.165821 225101.558087 474.448689 0.752643 455.182658 0.885740
Gradient Boosting Regressor 358.759991 242029.866308 491.965310 0.734042 267.367222 -25.521922
Neural Network 369.533026 262163.305875 512.018853 0.711918 211.139910 -68.978097
Stochastic Gradient Descent 399.704186 266651.370733 516.382969 0.706986 504.922495 -31.717054
Linear Regression 399.579896 267512.251601 517.215866 0.706040 482.908191 -28.937825
Ridge 399.597218 267515.297004 517.218810 0.706036 482.901372 -28.945068
Lasso 400.000643 267604.829700 517.305354 0.705938 482.911397 -29.218607
Elastic Net 457.783607 310797.582356 557.492226 0.658475 636.834457 -36.073312
Decision Tree 422.534791 407765.408259 638.565117 0.551920 471.186008 -12.077871
Adaboost Regressor 583.034156 426736.188421 653.250479 0.531074 749.475476 31.755897
Support Vector Regressor 765.358677 770179.800557 877.598884 0.153675 1148.574269 -218.317426

Using the default parameters, the best model is the Random Forest Regression model, with a RMSE of 458 and R2 score of 0.769. The worst model is the Support Vector model, with a RMSE of 877 and R2 score of 0.153.

We will proceed with the Random Forest Regression model and tune its hyperparameters to improve its performance.

Hyperparameter Tuning for the Random Forest Regression Model¶

Using Grid Search to find the best hyperparameters

In [ ]:
from sklearn.model_selection import GridSearchCV

Define the hyperparameters to tune

In [ ]:
param_grid = {
    "n_estimators": [200, 300],  # number of trees
    "max_depth": [10, None],  # maximum depth of each tree
    "min_samples_split": [2, 5],  # minimum number of samples required to split a node
    "min_samples_leaf": [1, 2],  # minimum number of samples required at each leaf node
    "max_features": ["sqrt", "log2"],  # number of features to consider at each split
    "bootstrap": [True, False],  # method of selecting samples for training each tree
    "random_state": [RANDOM_SEED],  # random seed
}
In [ ]:
rf = RandomForestRegressor()

Using 3-Fold Cross Validation to evaluate the model with different hyperparameter values

In [ ]:
# Instantiate the grid search model
grid_search = GridSearchCV(
    estimator=rf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2
)
# Fit the grid search to the data
scores = grid_search.fit(X_train, y_train)
Fitting 3 folds for each of 64 candidates, totalling 192 fits
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.1s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   0.8s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.1s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.4s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.6s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.7s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   0.9s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.4s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.5s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.5s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.4s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   0.9s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   0.9s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   0.9s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.4s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.3s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.3s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.4s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   0.9s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   0.9s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.3s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.3s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.4s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   0.8s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   0.9s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.6s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   0.8s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.4s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.5s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   0.9s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.4s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.3s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   0.9s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.2s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   0.8s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.3s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.2s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.2s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.1s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.2s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.2s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.2s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.7s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.9s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.1s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.8s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.6s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.7s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.4s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   0.9s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.5s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.3s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   0.9s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.6s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.4s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.4s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.2s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.1s
[CV] END bootstrap=True, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.5s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.1s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.1s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.7s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.9s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.7s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.6s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.1s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.7s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.1s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.6s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.4s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.5s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.0s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   0.9s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.2s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.6s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.4s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.7s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.2s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.1s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.2s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.5s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.8s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.8s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.2s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.2s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.3s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.9s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.9s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   2.0s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.2s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.4s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.4s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   2.0s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.9s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.3s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   2.1s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.6s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.5s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   2.2s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   2.3s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   2.0s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.5s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.5s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.4s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   2.1s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   2.0s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.3s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.3s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   2.0s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.4s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   2.0s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.9s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.1s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.2s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   2.0s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.3s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.8s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.8s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.2s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.2s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.8s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.2s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   1.7s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   2.0s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.9s
[CV] END bootstrap=False, max_depth=10, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.9s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.7s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.7s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.9s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.9s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.9s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   3.4s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.9s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   3.3s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   3.3s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   3.1s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   3.1s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.8s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.9s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.8s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   2.9s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   2.6s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.5s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.7s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   2.5s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.5s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   2.3s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   2.2s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   2.2s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.7s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.8s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   2.2s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.7s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   2.5s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.5s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   2.8s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.6s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.5s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=2, n_estimators=300, random_state=43; total time=   2.5s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   2.4s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   2.4s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.5s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.5s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200, random_state=43; total time=   1.5s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=300, random_state=43; total time=   2.4s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.6s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.5s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   2.5s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   2.7s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=200, random_state=43; total time=   1.5s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=300, random_state=43; total time=   2.1s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   2.0s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.9s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=300, random_state=43; total time=   1.6s

The best hypermarameters for the Random Forest Regression model are:

In [ ]:
grid_search.best_params_
Out[ ]:
{'bootstrap': False,
 'max_depth': None,
 'max_features': 'sqrt',
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 300,
 'random_state': 43}
In [ ]:
# Using the best estimator that is trained using 3-fold cross validation
regression_metrics(y_test, grid_search.best_estimator_.predict(X_test))
Out[ ]:
{'Mean Absolute Error': 320.4081448458532,
 'Mean Squared Error': 200021.18391009144,
 'Root Mean Squared Error': 447.2372792043296,
 'R2 Score': 0.7802034371098439,
 'Mean Absolute Percentage Error': 337.93869897545613,
 'Mean Bias Error': -13.199074736835223}
In [ ]:
from sklearn.pipeline import make_pipeline
In [ ]:
# A function to train a model and evaluate it
def model_it(X, y, model, dim_reduce=None, random_state=RANDOM_SEED, test_size=0.2):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    steps = [StandardScaler()]
    if dim_reduce:
        steps.append(dim_reduce)
    steps.append(model)
    pipeline = make_pipeline(*steps)
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    return pipeline, regression_metrics(y_test, y_pred), (y_test, y_pred)


# Using the best estimator parameters, train the model and evaluate it
(
    pipeline,
    metrics,
    _,
) = model_it(
    X,
    y,
    RandomForestRegressor(**grid_search.best_params_),
    dim_reduce=PCA(n_components=11, random_state=RANDOM_SEED),
    random_state=RANDOM_SEED,
)
pipeline, metrics
Out[ ]:
(Pipeline(steps=[('standardscaler', StandardScaler()),
                 ('pca', PCA(n_components=11, random_state=43)),
                 ('randomforestregressor',
                  RandomForestRegressor(bootstrap=False, max_features='sqrt',
                                        n_estimators=300, random_state=43))]),
 {'Mean Absolute Error': 320.06003789818317,
  'Mean Squared Error': 198856.6440063794,
  'Root Mean Squared Error': 445.9334524414819,
  'R2 Score': 0.781483110908292,
  'Mean Absolute Percentage Error': 336.55505317985006,
  'Mean Bias Error': -10.603309197779827})

Trying out models without feature reduction¶

In [ ]:
no_dim_reduce_regression_results = {}
for name, model in models.items():
    print(f"Modeling {name}...")
    pipeline, metrics, _ = model_it(
        X, y, model, dim_reduce=None, random_state=RANDOM_SEED
    )
    no_dim_reduce_regression_results[name] = metrics

no_dim_reduce_regression_results_df = pd.DataFrame(no_dim_reduce_regression_results).T
no_dim_reduce_regression_results_df.sort_values(by="R2 Score", ascending=False)
Modeling Linear Regression...
Modeling Ridge...
Modeling Lasso...
Modeling Elastic Net...
Modeling Random Forest...
Modeling Support Vector Regressor...
Modeling Gradient Boosting Regressor...
Modeling K-Nearest Neighbors...
Modeling Decision Tree...
Modeling Neural Network...
/opt/homebrew/lib/python3.11/site-packages/sklearn/neural_network/_multilayer_perceptron.py:691: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
  warnings.warn(
Modeling Stochastic Gradient Descent...
Modeling Adaboost Regressor...
Out[ ]:
Mean Absolute Error Mean Squared Error Root Mean Squared Error R2 Score Mean Absolute Percentage Error Mean Bias Error
Random Forest 256.803865 167480.645723 409.243993 0.815961 374.045361 -10.243749
Gradient Boosting Regressor 288.253586 181831.952450 426.417580 0.800191 442.914880 -10.369109
K-Nearest Neighbors 321.737687 222159.164973 471.337634 0.755877 459.336895 10.986202
Stochastic Gradient Descent 390.417655 258539.957997 508.468247 0.715899 392.405691 -29.985635
Lasso 391.941881 259922.668352 509.826116 0.714380 373.129578 -28.181217
Ridge 391.667577 260409.720046 510.303557 0.713845 378.620236 -28.480979
Linear Regression 391.696464 260518.340859 510.409973 0.713725 379.165057 -28.510247
Neural Network 375.809152 264079.350993 513.886516 0.709812 217.355974 -71.569474
Adaboost Regressor 450.733503 297565.555569 545.495697 0.673015 589.670260 -43.050280
Elastic Net 455.562872 307948.782428 554.931331 0.661605 604.902121 -36.053680
Decision Tree 324.741653 324060.063400 569.262737 0.643901 419.247085 -2.952272
Support Vector Regressor 769.850677 779939.486008 883.141827 0.142951 1146.575229 -221.312917

Trying Random Forest Regression without feature reduction with the best hyperparameters

In [ ]:
no_dim_reduce_rf_pipeline, no_dim_reduce_rf_metrics, no_dim_reduce_rf_ys = model_it(
    X,
    y,
    RandomForestRegressor(**grid_search.best_params_),
    dim_reduce=None,
    random_state=RANDOM_SEED,
)

no_dim_reduce_rf_pipeline
Out[ ]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('randomforestregressor',
                 RandomForestRegressor(bootstrap=False, max_features='sqrt',
                                       n_estimators=300, random_state=43))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('randomforestregressor',
                 RandomForestRegressor(bootstrap=False, max_features='sqrt',
                                       n_estimators=300, random_state=43))])
StandardScaler()
RandomForestRegressor(bootstrap=False, max_features='sqrt', n_estimators=300,
                      random_state=43)
In [ ]:
no_dim_reduce_rf_model = no_dim_reduce_rf_pipeline.named_steps["randomforestregressor"]
no_dim_reduce_rf_model
Out[ ]:
RandomForestRegressor(bootstrap=False, max_features='sqrt', n_estimators=300,
                      random_state=43)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(bootstrap=False, max_features='sqrt', n_estimators=300,
                      random_state=43)
In [ ]:
no_dim_reduce_rf_metrics
Out[ ]:
{'Mean Absolute Error': 256.551089927555,
 'Mean Squared Error': 155789.53863814945,
 'Root Mean Squared Error': 394.7018351086671,
 'R2 Score': 0.8288081069338133,
 'Mean Absolute Percentage Error': 398.1254764493029,
 'Mean Bias Error': -14.762168127768302}

It looks like the model performs slightly better without feature reduction, with a RMSE of 394 and R2 score of 0.828. This is expected as the Random Forest model is able to handle multicollinearity and feature interactions well.

Trying out Ensemble learning using stacking¶

In [ ]:
from sklearn.ensemble import StackingRegressor
In [ ]:
# define the base models
base_regressors = [
    ('svr', SVR()),
    ('ada', AdaBoostRegressor(random_state=RANDOM_SEED)),
    ('knn', KNeighborsRegressor()),
    ('rf', RandomForestRegressor(**grid_search.best_params_)),
]

# define the stacking ensemble
stacked_regressor = StackingRegressor(
    estimators=base_regressors,
    final_estimator=LinearRegression(),
    verbose=2,
)

stacked_pipeline, stacked_metrics, stacked_ys = model_it(
    X,
    y,
    stacked_regressor,
    dim_reduce=None,
    random_state=RANDOM_SEED,
)
stacked_pipeline, stacked_metrics
Out[ ]:
(Pipeline(steps=[('standardscaler', StandardScaler()),
                 ('stackingregressor',
                  StackingRegressor(estimators=[('svr', SVR()),
                                                ('ada',
                                                 AdaBoostRegressor(random_state=43)),
                                                ('knn', KNeighborsRegressor()),
                                                ('rf',
                                                 RandomForestRegressor(bootstrap=False,
                                                                       max_features='sqrt',
                                                                       n_estimators=300,
                                                                       random_state=43))],
                                    final_estimator=LinearRegression(),
                                    verbose=2))]),
 {'Mean Absolute Error': 247.22433880842823,
  'Mean Squared Error': 149659.85233279874,
  'Root Mean Squared Error': 386.85895664027055,
  'R2 Score': 0.8355438133983674,
  'Mean Absolute Percentage Error': 365.5321772342633,
  'Mean Bias Error': -13.285098460772305})

The Final Model¶

In [ ]:
final_model = stacked_pipeline
final_model
Out[ ]:
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('stackingregressor',
                 StackingRegressor(estimators=[('svr', SVR()),
                                               ('ada',
                                                AdaBoostRegressor(random_state=43)),
                                               ('knn', KNeighborsRegressor()),
                                               ('rf',
                                                RandomForestRegressor(bootstrap=False,
                                                                      max_features='sqrt',
                                                                      n_estimators=300,
                                                                      random_state=43))],
                                   final_estimator=LinearRegression(),
                                   verbose=2))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()),
                ('stackingregressor',
                 StackingRegressor(estimators=[('svr', SVR()),
                                               ('ada',
                                                AdaBoostRegressor(random_state=43)),
                                               ('knn', KNeighborsRegressor()),
                                               ('rf',
                                                RandomForestRegressor(bootstrap=False,
                                                                      max_features='sqrt',
                                                                      n_estimators=300,
                                                                      random_state=43))],
                                   final_estimator=LinearRegression(),
                                   verbose=2))])
StandardScaler()
StackingRegressor(estimators=[('svr', SVR()),
                              ('ada', AdaBoostRegressor(random_state=43)),
                              ('knn', KNeighborsRegressor()),
                              ('rf',
                               RandomForestRegressor(bootstrap=False,
                                                     max_features='sqrt',
                                                     n_estimators=300,
                                                     random_state=43))],
                  final_estimator=LinearRegression(), verbose=2)
SVR()
AdaBoostRegressor(random_state=43)
KNeighborsRegressor()
RandomForestRegressor(bootstrap=False, max_features='sqrt', n_estimators=300,
                      random_state=43)
LinearRegression()
In [ ]:
y_test, y_pred = stacked_ys
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.title("Actual vs Predicted Values")
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.plot(
    [y_test.min(), y_test.max()], [y_test.min(), y_test.max()], "k--", lw=2
)  # Diagonal line
plt.show()
No description has been provided for this image

The final model defines a stacked regression model, an ensemble learning technique combining multiple base regressors and a final estimator.

The base regressors include a Support Vector Regressor (SVR), an AdaBoostRegressor with a random state for reproducibility, a K-Nearest Neighbors Regressor (KNN), and a RandomForestRegressor whose parameters are optimized using grid search. These diverse models are trained independently to capture different patterns in the data. The predictions from these base models are then used as input to a final estimator, which is a Linear Regression model. This final model learns how to optimally combine the predictions of the base models to produce a more accurate and robust final prediction.

The model underwent training with a 80-20 data split and utilized the following configuration:

  • Model Configuration:
    • Stacking Regressor:

      • Base Regressors:

        • Support Vector Regressor (SVR): A regression algorithm based on support vector machine theory.
        • AdaBoostRegressor: An ensemble method using boosting, focusing on difficult cases in successive iterations, and initialized with a fixed random state for reproducibility.
        • K-Nearest Neighbors Regressor (KNN): An instance-based learning method predicting responses by interpolating the targets of nearest neighbors.
        • Random Forest Regressor: An ensemble of decision trees, optimized using parameters determined from a previous grid search.
          • Parameters: bootstrap=False, max_features='sqrt', n_estimators=300, random_state=43
      • Final Estimator:

        • Linear Regression
    • Data Preprocessing: Standard Scaling applied to features

Performance Metrics Analysis¶

  • Mean Absolute Error (MAE) - 247.22 kW:

    • The model, on average, errs by about 247.22 kW. This metric indicates the average absolute deviation of the model predictions from the actual values.
  • Mean Squared Error (MSE) - 149659.85 kW²:

    • The MSE, representing the average squared differences between predicted and actual values, is 149659.85 kW². This high value suggests that the model may have instances of significant errors in its predictions.
  • Root Mean Squared Error (RMSE) - 386.85 kW:

    • The RMSE, which is the square root of MSE, suggests that typical prediction errors are about 386.85 kW. This value provides an understanding of the error magnitude in the same unit as the target variable (kW).
  • R2 Score - 0.8355:

    • An R2 Score of 0.8355 indicates that approximately 83.55% of the variance in solar PV output is predictable by the model. This high R² value points to a strong fit to the data.
  • Mean Absolute Percentage Error (MAPE) - 365.53%:

    • The high MAPE suggests that there are instances where the model's predictions are significantly off from the actual values. This could be more pronounced in cases with lower solar PV outputs.
  • Mean Bias Error (MBE) - -13.285 kW:

    • The negative MBE indicates a slight tendency of the model to underestimate the solar PV output, with an average underestimation of around 13.285 kW.

The mofrl demonstrates strong predictive capabilities, particularly highlighted by the R2 Score. However, the significant MAE, MSE, and particularly the high MAPE, point to potential accuracy issues, especially in scenarios of lower solar PV output. The negative MBE suggests a consistent underestimation trend.

Feature Importance Analysis (based on RandomForestRegressor)¶

In [ ]:
importances = no_dim_reduce_rf_model.feature_importances_
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(12, 6))
plt.title("Feature Importances in RandomForest Regressor")
plt.bar(range(X.shape[1]), importances[indices], align="center")
plt.xticks(range(X.shape[1]), [X.columns[i] for i in indices], rotation=90)
plt.xlabel("Feature")
plt.ylabel("Importance")
plt.show()
No description has been provided for this image
In [ ]:
# display as table
pd.DataFrame(
    {
        "Feature": [X.columns[i] for i in indices],
        "Importance": importances[indices],
    }
)
Out[ ]:
Feature Importance
0 zenith 0.202861
1 angle_of_incidence 0.199005
2 azimuth 0.124835
3 shortwave_radiation_backwards_sfc 0.123308
4 total_cloud_cover_sfc 0.050858
5 relative_humidity_2_m_above_gnd 0.048323
6 mean_sea_level_pressure_MSL 0.034634
7 low_cloud_cover_low_cld_lay 0.030489
8 temperature_2_m_above_gnd 0.027970
9 medium_cloud_cover_mid_cld_lay 0.020056
10 wind_gust_10_m_above_gnd 0.019489
11 wind_speed_80_m_above_gnd 0.017278
12 wind_speed_900_mb 0.016905
13 wind_speed_10_m_above_gnd 0.016809
14 wind_direction_10_m_above_gnd 0.016469
15 wind_direction_900_mb 0.016216
16 wind_direction_80_m_above_gnd 0.015625
17 high_cloud_cover_high_cld_lay 0.011792
18 total_precipitation_sfc 0.006796
19 snowfall_amount_sfc 0.000281

The feature importance results from RandomForest Regressor provide valuable insights into which features are most influential in predicting the target variable in the dataset. Here's an analysis of the results:

Top Features¶

  • Zenith (20.28%): The most important feature. The zenith angle of the sun seems to have the highest impact on the model's predictions, suggesting that the position of the sun in the sky is crucial in determining solar power generation.
  • Angle of Incidence (19.9%): The second most important feature. It indicates the angle at which the sunlight strikes the solar panels, impacting their efficiency and, consequently, the power generation.

Other Significant Features¶

  • Azimuth (12.48%) and Shortwave Radiation Backwards SFC (12.33%): Both are related to the position and intensity of sunlight, emphasizing the importance of solar irradiance and panel orientation in solar power generation.
  • Relative Humidity and Total Cloud Cover: These atmospheric conditions significantly affect solar panel efficiency by influencing the amount of solar radiation that reaches the panels.

Lower Impact Features¶

  • Weather-related features like Mean Sea Level Pressure, Low Cloud Cover, and Temperature have moderate importance, indirectly affecting solar power generation by influencing the local climate and weather patterns.
  • Wind-related Features: Wind speed and direction at different levels have lower importance, indicating that they are less critical for power generation compared to solar irradiance and panel orientation.

Least Important Features¶

  • Snowfall Amount SFC: Has the least importance, possibly due to its minimal impact on solar power generation in most scenarios or due to the rarity of snowfall events in the dataset.

Summary¶

  • The analysis highlights the critical role of solar irradiance (zenith, angle of incidence, azimuth, shortwave radiation) in solar power generation.
  • Atmospheric conditions (humidity, cloud cover) play a significant role but are secondary to direct solar parameters.
  • The relatively lower importance of temperature and wind-related features suggests that these factors are less critical in predicting solar power generation compared to direct sunlight-related features.
  • These insights can guide more targeted data collection, feature engineering, and model refinement efforts in the field of solar energy forecasting.